Lecture 1: Review of linear regression
All models are wrong, but some are useful.
-George Box
We’ll focus on a handful of variables:
Sale_Price - Sale price of the house / $10K (response variable)Gr_Liv_Area - Above grade (ground) living area square feetOverall_Qual - Rates the overall material and finish of the houseSale_Price = $160K?Sale_Price is not a constant, but a random variable whose value varies from home to home (and year to year, etc.)We’re more interested in questions such as:
Sale_Price > $160K? (above median sale price)Sale_Price < $105K? (lowest decile)Sale_Price < $213,500? (within IQR)Sale-PriceCan look at historgram and empirical CDF:
Sale-PriceSale_PriceSale_Price \(\sim N\left(\mu, \sigma^2\right)\)Normal quantile-quantile (Q-Q) plot* can be used to asses the “normalityness” of a set of observations
Q-Q plots can, in general, be used to compare data with any distribution!
Normality tests, like the Shapiro-Wilk1 and Anderson-Darling tests, can also be used to assess normality
No data is normally distributes, what we care about is whether enough a normal approximation is close enough!
Normality tests provide a \(p\)-value, which only gives a yes/no conclusion
Recall that \(p\)-values are a function of sample size!
# Shapiro-Wilk test results vs. sample size
set.seed(101) # for reproducibility
x <- replicate(100, c(
shapiro.test(rt(10, df = 40))$p.value,
shapiro.test(rt(100, df = 40))$p.value,
shapiro.test(rt(500, df = 40))$p.value,
shapiro.test(rt(1000, df = 40))$p.value,
shapiro.test(rt(2500, df = 40))$p.value,
shapiro.test(rt(5000, df = 40))$p.value
))
rownames(x) <- paste0("n=", c(10, 100, 500, 1000, 2500, 5000))
rowMeans(x < 0.05) n=10 n=100 n=500 n=1000 n=2500 n=5000
0.01 0.08 0.11 0.22 0.24 0.39
Are these two distributions significantly different?
Are these two distributions practically different?
Try transformations
Try a more appropriate distribution (e.g., Poisson or gamma distribution)
Try more advanced nonparametric approaches, like bootstrapping!
\[\mu = \mu\left(x\right) = \beta_0 + \beta_1 x = E\left(Y|x\right)\]
Idea of LS is to find \(\beta_0\) and \(\beta_1\) so that the sum of squared residuals (i.e., errors) is minimized: \[SSE = \sum_{i=1}^n\left(y_i - \beta_0 - \beta_1x_i\right)^2\]
Pretty straightforward optimization problem that leads to closed-form solution (but no point in memorizing the formulas!)
Sale_Price and Gr_Liv_Area
Call:
lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)
Residuals:
Min 1Q Median 3Q Max
-48.347 -3.022 -0.197 2.273 33.432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3289634 0.3269703 4.064 4.94e-05 ***
Gr_Liv_Area 0.0111694 0.0002066 54.061 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.652 on 2928 degrees of freedom
Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
Which assumptions seem violated to some degree?
Residuals vs. predictor values (checking non-linearity).
Residuals vs. fitted values (non-constant variance, non-linearity, and outliers)
Residuals vs. time or another sequence (checking independence)
Residuals vs. omitted predictor values (missing potentially important predictors)
Normal QQ plot of residuals (non-normality).
And much, much more!
Sale_Price ~ Gr_Liv_AreaWhat assumptions appear to be in violation?
Call:
lm(formula = log(Sale_Price) ~ Gr_Liv_Area, data = ames)
Residuals:
Min 1Q Median 3Q Max
-2.36215 -0.15145 0.03091 0.16583 0.90332
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9692014 0.0169355 116.28 <2e-16 ***
Gr_Liv_Area 0.0005611 0.0000107 52.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2928 on 2928 degrees of freedom
Multiple R-squared: 0.4842, Adjusted R-squared: 0.484
F-statistic: 2749 on 1 and 2928 DF, p-value: < 2.2e-16
log(Sale_Price) ~ Gr_Liv_AreaAny better?
The (normal) multiple linear regression model assumes \(Y \sim N\left(\mu\left(\boldsymbol{x}\right), \sigma^2\right)\), where \[\mu\left(\boldsymbol{x}\right) = \beta_0 + \sum_{i=1}^p \beta_i x_i = \boldsymbol{x}^\top\boldsymbol{\beta}\]
LS estimation still provides unbiased estimate of \(\boldsymbol{\beta} = \left(\beta_0, \beta_1, \dots, \beta_p\right)^\top\): \(\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y}\)
Fitted values: \(\hat{\boldsymbol{y}} = \boldsymbol{X}\left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y} = \boldsymbol{H}\boldsymbol{y}\)
\(\boldsymbol{H}\) is the well-known “hat matrix”
Categorical variables can be handled in a number of ways in linear models, including
Let’s look at two (nominal) categorical variables (in your homework, you’ll look at Overall_Qual, and ordered factor with many levels):
If one of these homes downgraded from a paved driveway to a gravel driveway, would that cause the sale price to decrease? (Think very carefully here!)
R dummy encodes nominal factors by default:
Call:
lm(formula = log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive,
data = ames)
Residuals:
Min 1Q Median 3Q Max
-2.2344 -0.1345 0.0073 0.1453 0.8502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.428e+00 2.418e-02 59.059 < 2e-16 ***
Gr_Liv_Area 5.183e-04 9.520e-06 54.440 < 2e-16 ***
Central_AirY 3.464e-01 2.068e-02 16.747 < 2e-16 ***
Paved_DrivePartial_Pavement 1.334e-01 3.733e-02 3.574 0.000358 ***
Paved_DrivePaved 3.085e-01 1.975e-02 15.618 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2574 on 2925 degrees of freedom
Multiple R-squared: 0.6018, Adjusted R-squared: 0.6013
F-statistic: 1105 on 4 and 2925 DF, p-value: < 2.2e-16
How do you interpret the coefficients here?
If prediction is the goal (e.g., compared to inference and hypothesis testing), the model performance should be assessed rigorously
Data splitting techniques ae key and the type of data splitting to use often depends on the situation (e.g., cross-sectional vs. time-series data)
In simplest terms, random split the data into two parts: A and B. Build a model on part A and see how well it does with predicting the response in part B.
Leakage is a huge concern here, so data splitting has to be done carefully!
The PRESS statistic in linear regression is a special case (\(k = n\)) we get for free!